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Abstract 



For a robust leverage diagnostic in linear regression, iRousseeuw and van Zomeren 



1993 proposed using robust distance (Mahalanobis distance computed using robust es- 



timates of location and covariance). However, a design matrix X that contains coded 
categorical predictor variables is often sufficiently sparse that robust estimates of lo- 
cation and covariance cannot be computed. Specifically, matrices formed by taking 
subsets of the rows of X are likely to be singular, causing algorithms that rely on sub- 
sampling to fail. Following the spirit of Maronna and Yohai 200Cll |. we observe that 



extreme leverage points are extreme in the continuous predictor variables. We therefore 
propose a robust leverage diagnostic that combines a robust analysis of the continuous 
predictor variables and the classical definition of leverage. 



1 Background 

We consider linear regression models of the form 

Vi = xJiPi + XJ2/33 + XJ2I33 + ei (z = 1, . . . , n) (1) 

where xn € M^^ contains coded categorical predictor variables, Xi2 G contains 
continuous predictor variables and the elements of € M^^ are each products of at 
least one element of xn and at least one element of Xj2. Let Xk be the matrix with i 
row x~J^ for A: = 1, 2, 3 so that the design matrix X = [Xi X2 X^]. The dimension of X 
is n X p where p = Pi + P2 + Ps- 

Two classical leverage measures are the diagonal elements of the hat matrix (the hat 
values) 

hi = Hii = xJ{X^X)-^x, (i = l,...,n) (2) 
where x~J = (xj x^) is the i row of X and the Mahalanobis distance (MD) 

MDi = ^(x* - T{X*)yC{X*)-^{xl - T{X*)) (3) 
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where T{X*) is the arithmetic mean, C{X*) is the sample covariance matrix and X* 
is identical to X except that the constant column has been removed (if present in X). 
When X does contain a constant column, these two measures are related by 

/.,= <^ + i (4) 

n — 1 n 



2 Robustification 



Let {j'C^o*)^ (^('■o^)} l^e a robust estimator of location and covariance where the final 
estimate is a weighted mean and a weighted covariance matrix with weights w = 
{wi, . . . , Wn)^ , Wi G {0, 1}. The c ovariance estimator can additionally be rescaled 



by a factor c. The Fast MCD of iRousseeuw and van DriessenI [1999.] is one such esti 



mator. The final robust estimate of location is 



En 

and the final robust estimate of covariance is 

C7(-'')(X2) = " A X2 - dmg{w) {X2 - M) 

(Ei=i Wi) - 1 

where M is an n x p2 matrix with rows 

We then observe that the following modification of X2 



X2 = J . "i"" \^ . W{X2 -M) + M. (5) 

yields 

T(X2) = r(™^)(X2) and C{X2) = C^'"'\X2). (6) 

Our idea is to form the modified design matrix X = [Xi X2 X^] where X3 is formed as 
X3 but using the values in X2 in place of those in X2. We then define the robust hat 
value to be 

h^°'^ =xJ{X^X)-'x, (i = l,...,n) (7) 
and the robust distance to be 



RDi = Jix* - T{X*))-^C{X*)-\x* - T{X*)). (8) 
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3 Discussion 



When the Hnear regression model contains only an intercept term and continuous 
predictor variables, X* = X2, T{X*) = r(™^)(X2) and C{X*) = C^'-°^\X2) so 
that the quantity defined in equati on [H] is equivalent to the robust distance given in 
Rousseeuw and van Zomeren 199d |. Hence, we call this quantity robust distance as 



well. 

When pi > 1 (i.e., when there are coded categorical predictor variables), the robust 
distances in equation [8] are appropriate as a leverage diagnostic but not (in the author's 
opinion) as a distance measure in a multivariate setting. Therefore we recommend that 
software report the leverage diagnostic on the scale of the hat values. 



4 Example 



We turn to the epilepsy data published in iThall and Vaill 19901 ] for an example. 



> require (robustbase) 

> data (epilepsy) 

First make the design matrix. 



> X <- model .matrix(~ AgelO + Base4 * Trt, data = epilepsy) 

> n <- nrow(X) 

> head a; 

(Intercept) AgelO Base4 Trtprogabide Base4:Trtprogabide 

1 13. 12. 75 

2 13. 02. 75 

3 12. 51. 50 

4 13. 62. 00 

5 1 2.2 16.50 

6 12. 96. 75 

In this case we have 

> XI <- XL, c(l, 4)1 

> head (XI) 

(Intercept) Trtprogabide 
110 
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2 10 

3 10 

4 10 

5 10 

6 10 



> X2 <- X[, 2:3] 

> head(X2) 

AgelO Base4 

1 3.1 2.75 

2 3.0 2.75 

3 2.5 1.50 

4 3.6 2.00 

5 2.2 16.50 

6 2.9 6.75 

> X3 <- X[, 5, drop = FALSE] 

> head(X3) 

Base4 : Trtprogabide 

1 

2 

3 

4 

5 

6 

> mcd <- covMcd(X2) 

> w <- mcd$raw. weights 

> mcd$cov 

AgelO Base4 

AgelO 0.7463740 -0.3267283 
Base4 -0.3267283 10.0194113 

The implementation of the Fast MCD in the robustbase package rescales the final 
covariance matrix estimate by a consistency correction factor mcd$cnp[l] and a small 
sample correction factor mcd$cnp[l] so that c = prod(mcd$cnp). 

> cov.wt(X2, wt = w)$cov * prod(mcd$cnp) 

AgelO Base4 
AgelO 0.7463740 -0.3267283 
Base4 -0.3267283 10.0194113 
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> TX2 <- apply (X2 , 2, weighted. me an, w = w) 



Compute X2 by applying equation [5] to X2. 

> X2. tilde <- sweep(X2, 2, TX2) 

> X2. tilde <- sqrt(prod(mcd$cnp)*(n - l)/(sum(w) - 1) * w) * X2. tilde 

> X2. tilde <- sweep (X2. tilde, 2, TX2, FUN = ••+") 

Verify that C{X2) = C7('^°^)(X2). 



> var(X2. tilde) 



Age 10 Base4 
AgelO 0.7463740 -0.3267283 
Base4 -0.3267283 10.0194113 



We can obtain the modified data (not in general but for this example) by replacing X2 
in the original data and recomputing the design matrix. 

> epilepsy [dimnames (X2) [ [2] ] ] <- X2 

> X. tilde <- model .matrix (~ AgelO + Base4 * Trt, data = epilepsy) 

> head(X. tilde) 



(Intercept) AgelO Base4 Trtprogabide Base4:Trtprogabide 



1 13. 12. 75 

2 13. 02. 75 

3 12. 51. 50 

4 13. 62. 00 

5 1 2.2 16.50 

6 12. 96. 75 



The final robust leverage measure is then given be the diagonal element of the matrix 

x{x^x)-^x'^ . 



> diagCX %*% solve (t(X. tilde) %*% X. tilde) %*% t(X)) 

1 2 3 4 5 6 7 

0.05918398 0.05761964 0.07597885 0.08831037 0.12814167 0.03649363 0.05707197 

8 9 10 11 12 13 14 

0.13821982 0.06977150 0.05953140 0.08231479 0.04790064 0.06109578 0.06518841 

15 16 17 18 19 20 21 

0.21304208 0.06047114 0.04498471 0.38633944 0.04914452 0.07172279 0.05490496 

22 23 24 25 26 27 28 

0.09056742 0.05061124 0.04363259 0.06789648 0.12056569 0.10505741 0.07403980 
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29 30 31 32 33 34 35 

0.13316337 0.04489245 0.07575642 0.05223374 0.09433237 0.04382864 0.03457940 

36 37 38 39 40 41 42 

0.06124138 0.05326251 0.09628077 0.04761239 0.05961493 0.05079567 0.10109938 

43 44 45 46 47 48 49 

0.06090713 0.05230413 0.06278511 0.06904524 0.03396855 0.05985715 0.64794379 

50 51 52 53 54 55 56 

0.04181870 0.03780989 0.05743717 0.06796775 0.11009718 0.04673072 0.03927901 

57 58 59 

0.05935622 0.06818611 0.07601004 



References 

Ricardo A. Maronna and Victor J. Yohai. Robust regression with both continuous 
and categorical predictors. Journal of Statistical Planning and Inference, 89(1- 
2):197-214, 2000. ISSN 0378-3758. doi: 10.1016/S0378-3758(99)00208-6. URL 
[http:// www ■ sciencedirect . com/ science/ article/pii/S0378375899002086; 

Peter J. Rousseeuw and Katrien van Driessen. A fast algorithm for 
the minimum covariance determinant estimator. Technometrics, 41 

(3):212-223, 1999. doi: 10.1080/00401706.1999.10485670. URL 

|http : //amstat . tandf online . com/do i/abs/ 10 . 1080/00401706 . 1999 . 10485670 



Peter J. Rousseeuw and Bert C. van Zomeren. Unmasking multivariate outliers and 
leverage points. Journal of the American Statistical Association, 85(411) :pp. 633- 
639, 1990. ISSN 01621459. URL |http : //www . j stor . org/stable/2289995| 



P.F. Thall and S.C. Vail. Some covariance models for longitudinal count data with 
overdispersion. Biometrics, pages 657-671, 1990. 



6 



